All Questions
Tagged with scalaapache-spark
12 questions
1vote
1answer
156views
Group events close in time into sessions and assign unique session IDs
The following is a trimmed-down example of my actual code, but it suffices to show the algorithmic problem I'm trying to solve. Given is a DataFrame with events, each with a user ID and a timestamp. <...
4votes
1answer
196views
Rewriting scala code in object-oriented style style to reduce repetitive use of similar functions
I need help in rewriting my code to be less repetitive. I am used to coding procedural and not object-oriented. My scala program is for Databricks. how would you combine cmd 3 and 5 together? Does ...
2votes
0answers
1kviews
Spark Scala: SQL rlike vs Custom UDF
I've a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using "spark sql rlike" ...
2votes
2answers
801views
Scala app to transpose columns into rows
This is the first application or really any Scala I have every written. So far it functions as I would hope it would. I just found this community and would love some peer review of possible ...
1vote
1answer
5kviews
Joining Apache Spark data frames, with many conditional substitutions
I am joining two data frame in spark using scala . My code looks very ugly because of the multiple when condition . Can somebody please help me simplify my code? Here is my existing code . ...
3votes
0answers
2kviews
Apache spark compaction script to handle small files in hdfs
I have some use cases where I have small parquet files in Hadoop, say, 10-100 MB. I would to compact them so as to have files at least say 100 MB or 200 MB. The logic of my code is to: * find a ...
3votes
0answers
2kviews
Adding columns in Spark dataframe based on rules
I have a dataframe df, which contains below data: ...
2votes
0answers
176views
Reduce sample rate of GPS data based on distance between points
The algorithm needs to reduce an RDD[GPSRecord] based on the distance between several points, e.g. "give me only GPS records when the distance between them exceeds ...
3votes
1answer
115views
Classifying and counting database entries using Scala map and flatMap
I am new to Spark and Scala and I have solved the following problem. I have a table in database with following structure: ...
0votes
1answer
6kviews
Unit testing Spark transformation on DataFrame
Looking for suggestions on how to unit test a Spark transformation with ScalaTest. The test class generates a DataFrame from static data and passes it to a transformation, then makes assertion on the ...
5votes
0answers
718views
RandomForest multi-class classification
Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I ...
5votes
1answer
2kviews
Why does the LR on spark run so slowly?
Because the MLlib does not support the sparse input, I ran the following code, which supports the sparse input format, on spark clusters. The settings are: 5 nodes, each node with 8 cores (all the ...